{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nemo\n",
    "import nemo.collections.nlp as nemo_nlp\n",
    "from nemo.collections.nlp.data.datasets import BertTextClassificationDataset\n",
    "from nemo.collections.nlp.nm.data_layers.text_classification_datalayer import BertTextClassificationDataLayer\n",
    "from nemo.collections.nlp.nm.trainables import SequenceClassifier\n",
    "\n",
    "from nemo.backends.pytorch.common import CrossEntropyLossNM\n",
    "from nemo.utils.lr_policies import get_lr_policy\n",
    "from nemo.collections.nlp.callbacks.text_classification_callback import eval_iter_callback, eval_epochs_done_callback\n",
    "\n",
    "import os\n",
    "import json\n",
    "import math\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "pd.options.display.max_colwidth = -1\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.manifold import TSNE\n",
    "%matplotlib inline\n",
    "\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is based on [text_classification_with_bert.py](https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/text_classification/text_classification_with_bert.py)  training script, also see Text Classification documentation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Explore\n",
    "\n",
    "[The SST-2 dataset](https://nlp.stanford.edu/sentiment/index.html) is a standard benchmark for [sentiment classification task](http://nlpprogress.com/english/sentiment_analysis.html). The SST-2 is also a part of the [GLUE Benchmark](https://gluebenchmark.com/tasks). To download the SST-2 dataset as part of GLUE data, you can download [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and then run:\n",
    "\n",
    "`python download_glue_data.py --data_dir glue_data --tasks SST`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After running the script, the SST-2 could be found under ``glue_data/SST-2`` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to get the list of all pretrained BERT Language models run\n",
    "nemo_nlp.nm.trainables.get_pretrained_lm_models_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "WORK_DIR = 'output/'\n",
    "DATA_DIR = 'glue_data/SST-2'\n",
    "\n",
    "# To use mixed precision, set AMP_OPTIMIZATION_LEVEL to 'O1' or 'O2',\n",
    "# to train without mixed precision, set it to 'O0'.\n",
    "AMP_OPTIMIZATION_LEVEL = 'O1'\n",
    "# select the model name from the list generated in the previous cell\n",
    "PRETRAINED_MODEL_NAME = 'bert-base-uncased'\n",
    "MAX_SEQ_LEN = 64 # we will pad with 0's shorter sentences and truncate longer\n",
    "BATCH_SIZE = 256 # 64 for 'bert-large-uncased'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(DATA_DIR + '/train.tsv', sep='\\t')\n",
    "test_df = pd.read_csv(DATA_DIR + '/test.tsv', sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset comes with a train file (labeled) and a test file (not labeled).  We will use part of the train file for model validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split train to train and val and save to disk\n",
    "np.random.seed(123)\n",
    "train_mask = np.random.rand((len(df))) < .8\n",
    "train_df = df[train_mask]\n",
    "val_df = df[~train_mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to take advantage of NeMo's pre-built text classification data layer, the data should be formatted as \"sentence\\tlabel\" (sentence tab label)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will add a label column with all 0's (but they will not be used for anything).\n",
    "test_df['label'] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = test_df[['sentence', 'label']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save new train, val, and test to disk\n",
    "SPLIT_DATA_DIR = os.path.join(DATA_DIR, 'split')\n",
    "\n",
    "os.makedirs(SPLIT_DATA_DIR, exist_ok=True)\n",
    "\n",
    "train_df.to_csv(os.path.join(SPLIT_DATA_DIR, 'train.tsv'), sep='\\t', index=False)\n",
    "val_df.to_csv(os.path.join(SPLIT_DATA_DIR, 'eval.tsv'), sep='\\t', index=False)\n",
    "test_df.to_csv(os.path.join(SPLIT_DATA_DIR, 'test.tsv'), sep='\\t', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neural Modules\n",
    "\n",
    "In NeMo, everything is a Neural Module. Neural modules abstract data and neural network architectures. Where a deep learning framework like PyTorch or Tensorflow is used to combine neural network layers to create a neural network. \n",
    "NeMo is used to combine data and neural networks to create AI applications.\n",
    "The Neural Module Factory will then manage the neural modules, taking care to flow data through the neural modules, and is also responsible for training (including mixed precision and distributed), logging, and inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# instantiate the neural module factory\n",
    "nf = nemo.core.NeuralModuleFactory(log_dir=WORK_DIR,\n",
    "                                   create_tb_writer=True,\n",
    "                                   add_time_to_log_dir=False,\n",
    "                                   optimization_level=AMP_OPTIMIZATION_LEVEL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-trained models will be automatically downloaded and cached."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pre-trained BERT\n",
    "bert = model = nemo_nlp.nm.trainables.get_pretrained_lm_model(pretrained_model_name=PRETRAINED_MODEL_NAME)\n",
    "tokenizer = nemo.collections.nlp.data.tokenizers.get_tokenizer(tokenizer_name='nemobert', pretrained_model_name=PRETRAINED_MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note here that the BERT models we are working with are massive. This gives our models a large capacity for learning that is needed to understand the nuance and complexity of natural language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'{PRETRAINED_MODEL_NAME} has {bert.num_weights} weights')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we define and instantiate the feed forward network that takes as input our BERT embeddings. This network will be used to output the sentence classifications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# mlp classifier\n",
    "bert_hidden_size = bert.hidden_size\n",
    "\n",
    "mlp = SequenceClassifier(hidden_size=bert_hidden_size, \n",
    "                         num_classes=2,\n",
    "                         num_layers=2,\n",
    "                         log_softmax=False,\n",
    "                         dropout=0.1)\n",
    "\n",
    "loss = CrossEntropyLossNM()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compared to the BERT model, the MLP is tiny.\n",
    "print(f'MLP has {mlp.num_weights} weights')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pipelines\n",
    "\n",
    "Pipelines are used to define how data will flow the different neural networks. In this case, our data will flow through the BERT network and then the MLP network.\n",
    "\n",
    "We also have different pipelines for training, validation, and inference data.  \n",
    "\n",
    "For training data, we want it to be used for optimization so it must be shuffled and we also need to compute the loss.\n",
    "\n",
    "For validation data, we won't use it for optimization but we want to know the loss.\n",
    "\n",
    "And for inference data, we only want the final predictions coming from the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Layers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can gain a lot of efficiency by saving the tokenized data to disk. For future model runs we then don't need to tokenize every time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "USE_CACHE = True\n",
    "\n",
    "train_data = BertTextClassificationDataLayer(input_file=os.path.join(SPLIT_DATA_DIR, 'train.tsv'),\n",
    "                                             tokenizer=tokenizer,\n",
    "                                             max_seq_length=MAX_SEQ_LEN,\n",
    "                                             shuffle=True,\n",
    "                                             batch_size=BATCH_SIZE,\n",
    "                                             use_cache=USE_CACHE)\n",
    "\n",
    "val_data = BertTextClassificationDataLayer(input_file=os.path.join(SPLIT_DATA_DIR, 'eval.tsv'),\n",
    "                                           tokenizer=tokenizer,\n",
    "                                           max_seq_length=MAX_SEQ_LEN,\n",
    "                                           batch_size=BATCH_SIZE,\n",
    "                                           use_cache=USE_CACHE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_input, train_token_types, train_attn_mask, train_labels = train_data()\n",
    "val_input, val_token_types, val_attn_mask, val_labels = val_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BERT Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_embeddings = bert(input_ids=train_input,\n",
    "                        token_type_ids=train_token_types,\n",
    "                        attention_mask=train_attn_mask)\n",
    "val_embeddings = bert(input_ids=val_input,\n",
    "                      token_type_ids=val_token_types,\n",
    "                      attention_mask=val_attn_mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect BERT Embeddings\n",
    "\n",
    "If we want to inspect the data as it flows through our neural factory we can use the .infer method.  This method will give us the tensors without performing any optimization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_input_tensors = nf.infer(tensors=[val_input])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(val_input_tensors[0][0][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "val_embeddings_tensors = nf.infer(tensors=[val_embeddings])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# each word is embedded into bert_hidden_size space\n",
    "# shape: BATCH_SIZE * MAX_SEQ_LEN * bert_hidden_size\n",
    "val_embeddings_tensors[0][0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(val_embeddings_tensors[0][0][1][:, 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding and Visualizing BERT Embeddings\n",
    "\n",
    "We are going to look at the BERT embeddings for the words (1-word sentences) in \"SPLIT_DATA_DIR/positive_negative.tsv\". Since the BERT embeddings are 768 dimensional for BERT base and 1024 dimensional for BERT large, we'll first apply TSNE and reduce the embeddings to two dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_words = ['abysmal', 'apalling', 'dreadful', 'awful', 'terrible',\n",
    "                  'very bad', 'really bad', 'rubbish', 'unsatisfactory',\n",
    "                  'bad', 'poor', 'great', 'really good', 'very good', 'awesome'\n",
    "                  'fantastic', 'superb', 'brilliant', 'incredible', 'excellent'\n",
    "                  'outstanding', 'perfect']\n",
    "\n",
    "spectrum_file = os.path.join(SPLIT_DATA_DIR, 'positive_negative.tsv')\n",
    "with open(spectrum_file, 'w+') as f:\n",
    "    f.write('sentence\\tlabel')\n",
    "    for word in spectrum_words:\n",
    "        f.write('\\n' + word + '\\t0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_df = pd.read_csv(spectrum_file, delimiter='\\t')\n",
    "print(spectrum_df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# positive negative spectrum\n",
    "spectrum_data = BertTextClassificationDataLayer(input_file=spectrum_file,\n",
    "                                                tokenizer=tokenizer,\n",
    "                                                max_seq_length=MAX_SEQ_LEN,\n",
    "                                                batch_size=BATCH_SIZE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_input, spectrum_token_types, spectrum_attn_mask, spectrum_labels = spectrum_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_embeddings = bert(input_ids=spectrum_input,\n",
    "                           token_type_ids=spectrum_token_types,\n",
    "                           attention_mask=spectrum_attn_mask)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_embeddings_tensors = nf.infer(tensors=[spectrum_embeddings])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_embeddings_tensors[0][0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(100,100))\n",
    "plt.imshow(spectrum_embeddings_tensors[0][0][:,0,:].numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_activations = spectrum_embeddings_tensors[0][0][:,0,:].numpy()\n",
    "tsne_spectrum = TSNE(n_components=2, perplexity=10, verbose=1, learning_rate=2,\n",
    "                     random_state=123).fit_transform(spectrum_activations)\n",
    "\n",
    "fig = plt.figure(figsize=(10,10))\n",
    "plt.plot(tsne_spectrum[0:11, 0], tsne_spectrum[0:11, 1], 'rx')\n",
    "plt.plot(tsne_spectrum[11:, 0], tsne_spectrum[11:, 1], 'bo')\n",
    "for (x,y, label) in zip(tsne_spectrum[0:, 0], tsne_spectrum[0:, 1], spectrum_df.sentence.values.tolist() ):\n",
    "    plt.annotate(label, # this is the text\n",
    "                 (x,y), # this is the point to label\n",
    "                 textcoords=\"offset points\", # how to position the text\n",
    "                 xytext=(0,10), # distance from text to points (x,y)\n",
    "                 ha='center') # horizontal alignment can be left, right or center"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Pipeline \n",
    "\n",
    "In order to optimize our network, we need to pass the embeddings through the MLP network and then compute the loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_logits = mlp(hidden_states=train_embeddings)\n",
    "val_logits = mlp(hidden_states=val_embeddings)\n",
    "\n",
    "train_loss = loss(logits=train_logits, labels=train_labels)\n",
    "val_loss = loss(logits=val_logits, labels=val_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Callbacks\n",
    "\n",
    "Callbacks are used to record and log metrics and save checkpoints for the training and evaluation. We use callbacks to print to screen and also to tensorboard.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 1\n",
    "NUM_GPUS = 1\n",
    "LEARNING_RATE = 1e-5\n",
    "OPTIMIZER = 'adam'\n",
    "\n",
    "train_data_size = len(train_data)\n",
    "steps_per_epoch = math.ceil(train_data_size / (BATCH_SIZE * NUM_GPUS))\n",
    "\n",
    "train_callback = nemo.core.SimpleLossLoggerCallback(tensors=[train_loss, train_logits],\n",
    "                            print_func=lambda x:nemo.logging.info(f'Train loss: {str(np.round(x[0].item(), 3))}'),\n",
    "                            tb_writer=nf.tb_writer,\n",
    "                            get_tb_values=lambda x: [[\"train_loss\", x[0]]],\n",
    "                            step_freq=steps_per_epoch)\n",
    "\n",
    "eval_callback = nemo.core.EvaluatorCallback(eval_tensors=[val_logits, val_labels],\n",
    "                                            user_iter_callback=lambda x, y: eval_iter_callback(x, y, val_data),\n",
    "                                            user_epochs_done_callback=lambda x:\n",
    "                                                eval_epochs_done_callback(x, f'{nf.work_dir}/graphs'),\n",
    "                                            tb_writer=nf.tb_writer,\n",
    "                                            eval_step=steps_per_epoch)\n",
    "\n",
    "# Create callback to save checkpoints\n",
    "ckpt_callback = nemo.core.CheckpointCallback(folder=nf.checkpoint_dir,\n",
    "                                             epoch_freq=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr_policy_fn = get_lr_policy('WarmupAnnealing',\n",
    "                             total_steps=NUM_EPOCHS * steps_per_epoch,\n",
    "                             warmup_ratio=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "nf.train(tensors_to_optimize=[train_loss],\n",
    "         callbacks=[train_callback, eval_callback, ckpt_callback],\n",
    "         lr_policy=lr_policy_fn,\n",
    "         optimizer=OPTIMIZER,\n",
    "         optimization_params={'num_epochs': NUM_EPOCHS, 'lr': LEARNING_RATE})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-Gpu Training\n",
    "\n",
    "RESTART KERNEL BEFORE RUNNING THE MULTI-GPU TRAINING"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "num_gpus = 4\n",
    "!python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS text_classification_with_bert.py \\\n",
    "--pretrained_model_name $PRETRAINED_MODEL_NAME \\\n",
    "--data_dir $SPLIT_DATA_DIR \\\n",
    "--train_file_prefix 'train' \\\n",
    "--eval_file_prefix 'eval' \\\n",
    "--use_cache \\\n",
    "--batch_size $BATCH_SIZE \\\n",
    "--max_seq_length 64 \\\n",
    "--num_gpus $NUM_GPUS \\\n",
    "--num_epochs $NUM_EPOCHS \\\n",
    "--amp_opt_level $AMP_OPTIMIZATION_LEVEL \\\n",
    "--work_dir $WORK_DIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference Pipeline\n",
    "\n",
    "For inference we instantiate the same neural modules but now we will be using the checkpoints that we just learned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data = BertTextClassificationDataLayer(input_file=os.path.join(SPLIT_DATA_DIR, 'test.tsv'),\n",
    "                                            tokenizer=tokenizer,\n",
    "                                            max_seq_length=MAX_SEQ_LEN,\n",
    "                                            batch_size=BATCH_SIZE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_input, test_token_types, test_attn_mask, _ = test_data()\n",
    "test_embeddings = bert(input_ids=test_input,\n",
    "                        token_type_ids=test_token_types,\n",
    "                        attention_mask=test_attn_mask)\n",
    "test_logits = mlp(hidden_states=test_embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "test_logits_tensors = nf.infer(tensors=[test_logits])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_probs = torch.nn.functional.softmax(torch.cat(test_logits_tensors[0])).numpy()[:, 1] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = pd.read_csv(os.path.join(SPLIT_DATA_DIR, 'test.tsv'), sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df['prob'] = test_probs \n",
    "inference_file = os.path.join(SPLIT_DATA_DIR, 'test_inference.tsv')\n",
    "test_df.to_csv(inference_file, sep='\\t', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_classification(data_path):\n",
    "    df = pd.read_csv(data_path, sep='\\t')\n",
    "    sample = df.sample()\n",
    "    sentence = sample.sentence.values[0]\n",
    "    prob = sample.prob.values[0]\n",
    "    result = f'{sentence} | {prob}'\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_samples = 10\n",
    "for _ in range(num_samples):\n",
    "    print(sample_classification(inference_file))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference Results:\n",
    "the film is just a big , gorgeous , mind-blowing , breath-taking mess . | 0.2738656\n",
    "\n",
    "a sensual performance from abbass buoys the flimsy story , but her inner journey is largely unexplored and we 're left wondering about this exotic-looking woman whose emotional depths are only hinted at . | 0.48260054"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Single sentence classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def classify_sentence(nf, tokenizer, bert, mlp, sentence):\n",
    "    sentence = sentence.lower()\n",
    "    tmp_file = \"/tmp/tmp_sentence.tsv\"\n",
    "    with open(tmp_file, 'w+') as tmp_tsv:\n",
    "        header = 'sentence\\tlabel\\n'\n",
    "        line = sentence + '\\t0\\n'\n",
    "        tmp_tsv.writelines([header, line])\n",
    "\n",
    "    tmp_data = BertTextClassificationDataLayer(input_file=tmp_file,\n",
    "                                               tokenizer=tokenizer,\n",
    "                                               max_seq_length=128,\n",
    "                                               batch_size=1)\n",
    "    \n",
    "    tmp_input, tmp_token_types, tmp_attn_mask, _ = tmp_data()\n",
    "    tmp_embeddings = bert(input_ids=tmp_input,\n",
    "                          token_type_ids=tmp_token_types,\n",
    "                          attention_mask=tmp_attn_mask)\n",
    "    tmp_logits = mlp(hidden_states=tmp_embeddings)\n",
    "    tmp_logits_tensors = nf.infer(tensors=[tmp_logits, tmp_embeddings])\n",
    "    tmp_probs = torch.nn.functional.softmax(torch.cat(tmp_logits_tensors[0])).numpy()[:, 1] \n",
    "    print(f'{sentence} | {tmp_probs[0]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = ['point break is the best movie of all time',\n",
    "             'the movie was a wonderful exercise in understanding the struggles of native americans',\n",
    "             'the performance of diego luna had me excited and annoyed at the same time',\n",
    "             'matt damon is the only good thing about this film']\n",
    "\n",
    "for sentence in sentences:\n",
    "    classify_sentence(nf, tokenizer, bert, mlp, sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding and Visualizing BERT Embeddings\n",
    "\n",
    "Now that we've fine-tuned our BERT model, let's see if the word embeddings have changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_embeddings = bert(input_ids=spectrum_input,\n",
    "                           token_type_ids=spectrum_token_types,\n",
    "                           attention_mask=spectrum_attn_mask)\n",
    "\n",
    "spectrum_embeddings_tensors = nf.infer(tensors=[spectrum_embeddings])\n",
    "\n",
    "plt.figure(figsize=(100,100))\n",
    "plt.imshow(spectrum_embeddings_tensors[0][0][:,0,:].numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spectrum_activations = spectrum_embeddings_tensors[0][0][:,0,:].numpy()\n",
    "tsne_spectrum = TSNE(n_components=2, perplexity=10, verbose=1, learning_rate=2,\n",
    "                     random_state=123).fit_transform(spectrum_activations)\n",
    "\n",
    "fig = plt.figure(figsize=(10,10))\n",
    "plt.plot(tsne_spectrum[0:11, 0], tsne_spectrum[0:11, 1], 'rx')\n",
    "plt.plot(tsne_spectrum[11:, 0], tsne_spectrum[11:, 1], 'bo')\n",
    "for (x,y, label) in zip(tsne_spectrum[0:, 0], tsne_spectrum[0:, 1], spectrum_df.sentence.values.tolist() ):\n",
    "    plt.annotate(label, # this is the text\n",
    "                 (x,y), # this is the point to label\n",
    "                 textcoords=\"offset points\", # how to position the text\n",
    "                 xytext=(0,10), # distance from text to points (x,y)\n",
    "                 ha='center') # horizontal alignment can be left, right or center"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
